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Diversification of transcription-associated protein (TAP) families during land plant evolution is a key process yielding 
increased complexity of plant life. Understanding the evolutionary relationships between these genes is crucial to gain insight 
into plant evolution. We have determined a substantial set of TAPs that are focused on, but not limited to, land plants using 
PSI-BLAST searches and subsequent filtering and clustering steps. Phylogenies were created in an automated way using a 
combination of distance and maximum likelihood methods. Comparison of the data to previously published work confirmed 
their accuracy and usefulness for the majority of gene families. Evidence is presented that the flowering plant apical stem cell 
regulator WUSCHEL evolved from an ancestral homeobox gene that was already present after the water- to- land transition. 
The presence of distinct expanded gene families, such as COP1 and HIT in moss, is discussed within the evolutionary 
backdrop. Comparative analyses revealed that almost all angiosperm transcription factor families were already present in the 
earliest land plants, whereas many are missing among unicellular algae. A global analysis not only of transcription factors but 
also of transcriptional regulators and novel putative families is presented. A wealth of data about plant TAP families and all 
data accrued throughout their automated detection and analysis are made available via the PlanTAPDB Web interface. 
Evolutionary relationships of these genes are readily accessible to the nonexpert at a mouse-click. Initial analyses of selected 
gene families revealed that PlanTAPDB can easily be exerted for knowledge discovery. 



The coordinated expression control of the entirety of 
genes in a given cell determines its physiological state, 
morphology, and identity in the organism. Reprogram- 
ming the set of transcribed genes during development 
or physiological adaptation requires modulated ac- 
tivation and deactivation of regulatory factors. In 
eukaryotes, the transcription of protein-coding genes 
is controlled by complex networks of transcription- 
associated proteins (TAPs). Specific transcription factors 
(TFs) activate or repress transcription of their target 
genes by binding to cis-active elements. Further tran- 
scriptional regulators (TRs) include the following: (1) 
coactivators and corepressors, which bind and influ- 
ence TFs; (2) general transcription initiation factors, 
which recognize core promoter elements and recruit 
components of the basal transcription machinery; and 
(3) chromatin remodeling factors, which affect the 
accessibility of DNA through histone modifications 
and DNA methylation. The modular nature of TFs, 
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possessing DNA-binding and protein-protein interac- 
tion domains, facilitates the high diversity of transcrip- 
tional regulation. 

Changes in transcriptional regulation enhance com- 
plexity at the genetic level and thus can generate novel 
signal transduction pathways. Such changes, medi- 
ated by recombined complexes of regulatory proteins 
as well as by altered regulatory sequence elements, 
were repeatedly proposed to be a major driving force 
of evolution (Doebley and Lukens, 1998; Tautz, 2000; 
Hsia and McGinnis, 2003; Levine and Tjian, 2003; 
Gutierrez et al., 2004; Carroll, 2005). Previous studies 
have shown that TAPs are highly specific across pro- 
karyotic and eukaryotic lineages and that their diver- 
sity appears to be linked to their phylogenetic distance 
(Coulson et al., 2001; Coulson and Ouzounis, 2003). In 
eukaryotes, key players of the basal transcription 
machinery are highly conserved, whereas many fami- 
lies of DNA-binding TFs are taxon specific and show 
substantial sequence diversity (Coulson and Ouzounis, 
2003). Moreover, the size and genomic fraction of TF 
families seem to correlate with cellular complexity 
(Levine and Tjian, 2003). 

The evolution of eukaryotic TF genes involves the 
processes of specific amplification of common families 
through duplication and diversification, as well as the 
shuffling of functional domains, resulting in lineage- 
specific families that can facilitate novel networks of 
protein-protein interactions and can take over new 
functions. In plants, the evolution and expansion of 
specific gene families seem to be more pronounced than 
in other eukaryotes (Lespinet et al., 2002). In Arabid op- 
sis (Arabidopsis thaliana), genes involved in transcrip- 
tional regulation were preferentially retained following 
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whole-genome duplications (Blanc and Wolfe, 2004; 
Seoighe and Gehring, 2004). It could be demonstrated 
that TF genes show a higher duplicability as well as 
retention rate in seed plants compared to other crown 
eukaryotes and other plant genes (Shiu et al., 2005), 
which results in considerable lineage-specific expan- 
sion of distinct TF families in plants. Consequently, 
45% of the TF genes in Arabidopsis were found to 
belong to families that are specific to plants (Riechmann 
et al., 2000). Evidence that many plant-specific proteins 
resemble TFs (Gutierrez et al., 2004) further supports 
the assumption that the increase of complexity in 
transcriptional regulation mechanisms has been cru- 
cial for the evolution of plants. 

In recent years, much emphasis was placed on the 
understanding of regulatory networks controlling the 
transcription of genes. Genome-wide comparative 
analyses aid in revealing the evolution of transcrip- 
tional regulation that underlies the diversity of organ- 
isms. TAP genes and transcriptional networks have 
been studied extensively in unicellular organisms (e.g. 
Kyrpides and Ouzounis, 1999; Perez-Rueda et al., 2004; 
Madan Babu et al., 2006), as well as in basal metazoans 
(Satou and Satoh, 2005; Larroux et al., 2006) and crown 
eukaryotes (Messina et al., 2004; Reece-Hoyes et al, 
2005). Within the plant kingdom, only two seed plants, 
Arabidopsis and rice (Oryza sativa), were globally in- 
vestigated (for review, see Qu and Zhu, 2006) and their 
TAP gene families compared to those of the unicellular 
green alga Chlamydomorws reinhardtii, fungi, and meta- 
zoans (Riechmann et al., 2000; Shiu et al., 2005). Little is 
known about TAPs in nonseed plants, like the bryo- 
phyte Physcomitrella patens, and no genome-wide com- 
pendium of its TAP genes is available, as is the case for 
nongreen algae. 

While phylogenetic studies have been carried out for 
single TAP families, e.g. sigma factors, LEAFY (LFY)/ 
FLO, MADS, and AP2 (Ichikawa et al, 2004; Maizel 
et al., 2005; Riese et al., 2005; Shigyo et al., 2006), a large- 
scale phylogenetic analysis of TAP gene families from 
nonseed plants is still lacking. Here, we investigated 
and compared plant TAP gene families on a genome- 
wide scale across species of all three domains of life to 
gain insight into the evolution of transcriptional regu- 
lation in plants. We covered the whole evolutionary 
range from unicellular algae through bryophytes to 
angiosperms by including genomic-scale sequence 
data of the diatom Thalassiosira pseudonana, the red 
alga Cyanidioschyzon merolae, the green alga C. reinhard- 
tii, the moss P. patens, the monocot rice, and the dicot 
Arabidopsis. The moss P. patens diverged from the 
ancestor of extant flowering plants at least 450 million 
years ago (Theissen et al., 2001; Hedges et al., 2004). It 
was chosen as an offset for this study because, in 
comparison with flowering plants, it might enable 
inference of the ancestral state of land plant transcrip- 
tional regulation. A comprehensive analysis of gene 
families can be performed using the large collection of 
clustered expressed sequence tag (EST) data (Rensing 
et al., 2002; Lang et al., 2005). Starting from the complete 



set of P. patens candidate TAP genes, we collected 
homologs using PSI-BLASTand carried out automated 
filtering and clustering procedures, followed by man- 
ual annotation. From the resulting ample pool of TAP 
genes, taxonomic distribution, lineage-specific expan- 
sion, and high-quality phylogenies were inferred. 



RESULTS AND DISCUSSION 

Availability: All resources are available via the 
PlanTAPDB Web interface (http://www.cosmoss.org/ 
bm/plantapdb). 

Compilation of the Query Dataset 

In terms of evolution, mosses are located half way 
between seed plants and algae and were therefore 
chosen as an offset for the global phylogenetic analysis 
of plant TAPs. In addition, mosses morphologically 
resemble the first plants that occupied the land 
(Kenrick and Crane, 1997). In the moss P. patens, a 
total of 1,592 putative TAPs (PTs) were identified from 
a comprehensive clustered and annotated EST data- 
base (Lang et al., 2005) by two strategies: (1) TBLASTN 
searches with plant and algae reference TAPs com- 
piled by relaxed keyword searches, and (2) motif scans 
for transcription-associated domains. The resulting 
comprehensive set of candidate moss TAPs included 
nearly all TF families known from seed plants (http:// 
arabtfdb.bio.uni-potsdam.de /vl.l/, http://ricetfdb. 
bio.uni-potsdam.de/v2.1/; Riechmann et al, 2000; 
Guo et al., 2005; Gao et al., 2006), as well as sequences 
putatively encoding TAPs. False-positive sequences 
introduced by this compilation of queries were later 
removed during the annotation process. To avoid 
potentially fragmentary virtual transcripts, we deter- 
mined the full-length closest homolog for each of the 
moss candidate TAPs to be used subsequently as seed 
query sequence. For a homolog to be considered, its 
BLASTX match needed to be in the same frame as the 
original annotation of the moss candidate transcript 
and its predicted open reading frame (ORF). A closest 
homolog could be determined for about 99% of the 
1,592 P. patens candidate TAPs. For 19 of the candidate 
sequences, no homolog was found, yet 12 of those 
were included into the seed query set because they 
contained a predicted ORF. The complete nonredun- 
dant set of closest homologs used for PSI-BLAST 
searches comprises 1,162 sequences (Fig. 1). This 
seed set contains mainly sequences of plant origin 
(88% Viridiplantae), around 8% of which are derived 
from bryophytes. Besides 5% of patented sequences, 
for which no taxon annotation is available, the re- 
mainder of the sequences are distributed across 
metazoa (3%), bacteria (2%), fungi (1%), lower eukary- 
otes (0.4%), viruses (0.2%), and Archaea (0.1%). The 
usage of PSI-BLAST enables the detection even of 
distant homologs (Schaffer et al., 2001), e.g. from algae, 
fungi, animals, Eubacteria, or Archaea. 



Plant Physiol. Vol. 143, 2007 



1453 



Richardt et al. 



/ 1,162 TAP / 
/ aeodquert**/ 

PSL8LAST 

/ 144,941 T 
/ distinct htta/ 




Figure 1 . Flowchart of TreePipe and PlanTAPDB. (See online article for 
color version of this figure.] 



Filtering and Clustering of PSI-BLAST Results 

During the PSI-BLAST searches, 369,118 hits were 
generated, representing a total of 144,941 distinct pro- 
tein sequences (Fig. 1). To deal with the differences in 
degree of conservation and family size between gene 
families, we deployed an iterative six-step filtering 
scheme that optimizes the applied filtering criteria and 
the selected PSI-BLAST iteration for each query se- 
quence individually. The most stringent step (6), de- 
manding at least 45% sequence identity and 300 amino 
acids in alignment length, was designed to reduce 
domain-derived superfamilies to family or subfamily 
level. Smaller and more diverse superkingdom- 
spanning families were handled via the least stringent 
step (I), allowing hits from the fringe of the "twilight- 
zone" (Rost, 1999) with at least 25% sequence identity 
and 50-amino acid alignment length. The four interme- 
diate steps (see "Materials and Methods" for details) 
were designed to assess conservation grades between 
these two extremes. In total, 115,593 hits (31%) passed 
the filtering procedure. The majority of sequences 
(90%) were filtered by steps 3 to 6 (step 3: 21%; step 4: 
23%; step 5: 19%; step 6: 27%). For most of the queries 
(79%), results from the first PSI-BLAST iteration were 



preferred in order to avoid potential false-positive hits. 
Overlapping filtered result sets were merged to recover 
family relations by single linkage clustering using a 
stringent hit-coverage-based distance measure. The 
resulting 540 clusters contained 60,504 cluster mem- 
bers, representing 52,764 distinct sequences (Fig. 1). 
Because some clusters represent different yet possibly 
overlapping parts of one and the same gene family (see 
section "Cluster Annotation"), individual sequences 
can be part of more than one cluster, as indicated by an 
overlap of 12.8% among the clusters. On average, the 
clusters contain 112 members, with the largest cluster 
containing 1,182 and the smallest 21 members. The 
filtering and clustering procedure was developed and 
tested using queries derived from 93 previously deter- 
mined gene families of different function covering 
algae, moss, and seed plants. In this test case, 91 fami- 
lies were recovered as expected, whereas two gene 
families were merged. Inspection of the merged cluster 
revealed that the two families are indeed part of a larger 
subfamily (ATPases). Therefore, the filtering and clus- 
tering procedure is able to recover family structure with 
good performance. 

Redundancy Removal and Homology Reduction 

While it greatly improves taxon sampling, the strat- 
egy to use both a huge multispecies-containing data- 
base like UniProt and the individual whole-genome 
protein predictions results in the detection of identical 
protein sequences from these overlapping databases. In 
addition, the same locus is often represented by more 
than one protein sequence due to divergent predicted 
gene models, splice variants, as well as sequencing and 
annotation errors. To cope with this problem, redun- 
dant copies of genes were eliminated prior to all func- 
tional analyses using an identity cutoff of ^99% for 
sequences of the same species. The total number of 
cluster members was thus reduced by 30%, resulting in 
42,133 total sequences, 37,247 of which are distinct. 

In addition, a homology-reduced set of the 540 clus- 
ters was compiled to infer phylogenies (Fig. 1). Phylo- 
genetic inference of large clusters is computationally 
costly, and the interpretation and inference of results 
from huge trees is difficult. As a total of 102 clusters had 
more than 150 members, these were condensed via 
stepwise homology reduction until the threshold of 150 
members was reached. The homology-reduced clusters 
contain 29,317 cluster members in total, 26,595 of which 
are distinct. The average pairwise distances within 
clusters were found to be in the range of 12% to 95% 
identity with an average of 44%. 

Multiple Alignments and Selection of Conserved Sites 

Due to errors introduced by the alignment algorithm, 
a certain fraction of columns in a multiple sequence 
alignment (MSA) generates noise that disturbs correct 
inference of phylogenetic relationships (Castresana, 
2000; Rosenberg, 2005). Such positions are usually 
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removed manually in the course of a phylogenetic anal- 
ysis. While current approaches to automated phyloge- 
nies (Sicheritz-Ponten and Andersson, 2001; Fuellen 
et al., 2003; Frickey and Lupas, 2004; Gouret et al., 
2005) mostly rely on unprocessed ClustalW alignments, 
we placed more emphasis on the alignment quality to 
increase the reliability of the resulting phylogenies. 
Thus, we used a measure that describes evolutionary 
informative sites. We implemented a best-of-two ap- 
proach, during which two alignments were (1) calcu- 
lated using different state-of-the-art algorithms and (2) 
filtered using the sum-of -pairs score. In the next step, the 
alignment with the maximum number of remaining 
columns was chosen (Fig. 1). On average, the alignments 
consisted to 65% of gaps and were reduced to 28% of the 
original alignment length by applying this procedure. In 
71% of the cases, the MAFFT G-INSI (Katoh et al, 2005) 
alignment was selected to represent the cluster, whereas 
ProbCons (Do et al, 2005) or Muscle (Edgar, 2004) were 
chosen for 29% of the clusters. 

Automated Reconstruction of High-Quality Phylogenies 

Many approaches to phylogenomics rely solely on a 
distance approach using neighbor joining (NJ; Saitou 
and Nei, 1987). However, NJ is known to be susceptible 
to noisy data, provides no confidence measures, and 
makes ithard tocompute reliable distances for strongly 
divergent sequences. Probabilistic approaches, such as 
maximum likelihood (ML) and Bayesian methods, are 
known to overcome most of these problems, but both 
are very time consuming and thus usually not applied 
in large-scale phylogenomics approaches. We followed 
a combined approach by calculating ML consensus 
branch lengths using gamma-distributed rates from 
bootstrapped NJ topologies (Fig. 1). We compared 
published phylogenies of plant TAP families to those 
created by the approach presented here. In general, the 
same topology was recovered and the same conclu- 
sions could be drawn from the automatically generated 
phylogenies described here. For example, homologs of 
the floral regulator LFY, a plant-specific TF, are present 
in all land plants. The LFY phylogeny is characterized 
by two deep clefts separating (1) angiosperms from 
gymnosperms and ferns and (2) mosses from gymno- 
sperms and ferns (Maizel et al., 2005). The same can be 
seen in the automatically generated PlanTAPDB LFY 
tree (TF037, accessible via the Web interface), while the 
increased taxonomic sampling of the cluster presented 
here even results in higher resolution of the phylogeny. 
As another example, both the automatically generated 
Retinoblastoma family tree, TR030, and a published 
phylogeny (Sabelli and Larkins, 2006) reveal lineage- 
specific expansion of this gene family in grasses. Phy- 
logenetic trees of gene families can be utilized to 
analyze the evolution of a gene of interest, to discover 
orthologs, and to aid functional gene annotation. While 
phylogenies have been published for several plant TF 
families (e.g. Theissen et al., 2000; Ichikawa et al., 2004; 
Maizel et al., 2005; Sabelli and Larkins, 2006; Shigyo 



et al., 2006), this study presents phylogenies with a 
dense taxon resolution for plant-anchored gene fami- 
lies and subfamilies not only of TFs but also of TRs and 
PTs. These data can in turn be applied as a tool for 
knowledge discovery. In addition to the phylogenies 
described above, which are based on the homology- 
reduced clusters, we also calculated initial phylogenies 
for the full clusters prior to the homology reduction 
step using bootstrapped NJ. 

Cluster Annotation 

The functional annotation of the 540 candidate TAP 
clusters was inferred from identified Inter-Pro domains 
and associated Gene Ontology (GO) terms (Camon 
et al., 2004) of the cluster members after redundancy 
removal. A total of 482 out of 540 clusters contained one 
or more Inter- Pro domains with a relative occurrence of 
>80% among the nonredundant cluster members. 
While those were used for automated annotation, clus- 
ters with uncertain domain occurrence were manually 
checked and annotated. In total, only three clusters 
were composed of sequences from multiple unrelated 
TAP families. These large mixed clusters were formed 
due to shared DNA-binding or protein-protein inter- 
action domains (IPR001487 Bromodomain, IPR002110 
Ankyrin, IPR002713 FF domain) and were not further 
considered in this study. Members of 237 clusters are 
not directly associated with transcriptional regulation 
but function in related processes, such as DNA and 
RNA metabolism, and were also not further consid- 
ered. They derive from the loose initial query selection 
intended to include as many as possible novel TAP 
families. The vast majority (94%) of the remaining 300 
annotated TAP clusters (Fig. 1) contain sequences of 
single families or subfamilies. This confirms that the 
single-linkage clustering approach successfully formed 
clusters according to functional gene families and sub- 
families. In some cases (18 clusters), closely related 
(sub)families are represented by a single cluster due to 
shared domains or conserved regions. For instance, two 
types of regulators of the auxin response, ARF and 
Aux/IAA, form one cluster (TF007) due to the shared 
Aux/IAA-ARF dimerization domain. Likewise, very 
small or orphan TAP families are sometimes com- 
pletely submerged within clusters of related families 
(e.g. AN/TF002 and NPRl-like/TR023). Large gene 
families composed of several diverse subfamilies (e.g. 
C2C2/TF012-TF015) are sometimes represented by 
two types of overlapping PlanTAP clusters. They either 
provide a global view across the main family (C2C2 - 
GATA, CO-like, and Pseudo ARR-B/TF015) or exclu- 
sively span single subfamilies (C2C2 - CO-like/TF012; 
C2C2 - Dof/TF013; C2C2 - GATA/TF014). 

TAP Gene Family Annotation 

TAP clusters with the same functional annotation 
(main and subfamily), which had not been merged 
during single linkage clustering due to the stringent 
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parameters applied there, were manually grouped, 
resulting in 138 families of TAPs (Supplemental Table 
SI). This resulted in a total number of 14,680 nonre- 
dundant TAP family members, while the remaining 
overlap among the families was reduced to 3.6% 
(14,157 distinct nonredundant family members; Fig. 
1), indicating a good separation of the gene families. 
Fifty-four of the TAP families are represented by more 
than one cluster of deviating but partially overlapping 
composition. These multiple clusters depict the par- 
ticular TAP family either from a different taxonomic 
perspective (e.g. restricted to the plant lineage versus 
covering all kingdoms) or comprise different subfami- 
lies. Because large TAP gene families are substantially 
divergent outside of their conserved domains, it ap- 
pears more reasonable to deduce phylogenies from 
subgroups to be able to utilize as much homologous 
sequence information as possible. The phylogenetic 
trees were therefore derived for each of the 300 sep- 
arate TAP clusters. 

We divided the TAP families into three categories 
according to their molecular function and associated 
GO terms: (1) DNA-binding TFs (59), which comprise 
direct activators or repressors of transcription; (2) TRs 
(56), comprising basal TFs interacting with RNA poly- 
merase II or the core promoter, coactivators/core- 
pressors, and chromatin remodeling factors; and (3) 
proteins with unknown function and /or domains that 
are possibly associated with transcriptional regulation 
(FT, 23; Fig. 1). 

Previously, plant TF gene families were globally iden- 
tified in two seed plants, Arabidopsis and rice (http:// 
arabtfdb.bio.uni-potsdam.de/vl.l/, http://ricetfdb. 
bio.uni-potsdam.de/v2.1/; Riechmann et al., 2000; 
Guo et al., 2005; Gao et al., 2006). Of the previously 
described TF families, just 14 are not present among the 
annotated families due to their absence from the P. 
patens candidate TAP set. However, eight of those 
(AS2/LOB, BES1, BZR, GeBP, GFR/ENBP, HRT-like, 
TCP, VOZ) could be identified in the whole-genome 
shotgun sequences produced by the U.S. Department 
of Energy Joint Genome Institute (http://www.jgi. 
doe.gov/), which became available recently, i.e. they 
were not covered by the clustered EST database used 
for query compilation. This confirms earlier estimates 
(Rensing et al., 2002) and shows that the EST data cover 
the P. patens transcriptome almost completely (in terms 
of TAP families, the coverage is 95%). For the other six 
missing TF families (C2C2-YABBY, NOZZLE [NZZ], 
PBF-2-like/Whirly, SI Fa-like, STERILE APETALA 
[SAP], ULTRAPETALA [ULTJ), no homologs could be 
identified in the P. patens genomic traces. This might be 
due to the actual lack of these genes in the P. patens 
genome (which might also be a derived feature, i.e. 
secondary gene loss) or because differing rates of 
mutation fixation render detection using only homol- 
ogy searches impossible. Yet, the above-mentioned 
results demonstrate that using moss as an offset to 
identify a broad scope of plant TAPs is a valid ap- 
proach, as only 4% of angiosperm TF families are 



unaccounted for. Furthermore, it provides evidence 
that the majority of flowering plant TF families can be 
tracked down to the basal land plant P. patens. The 
above-mentioned TF gene families that are absent from 
moss are all of small size and have specialized functions 
in flowering plants. They probably emerged after 
the evolutionary split of mosses and seed plants. The 
vegetative and reproductive development of flowering 
plants is entirely different from that of mosses, the life 
cycle of which is dominated by a haploid gametophytic 
phase. They do not possess flowers, the organs for 
sexual reproduction of angiosperms. While mosses do 
contain homologs of some angiosperm (floral) home- 
otic genes, like KNOX (TF031) and MIKC-type MADS 
box (TF041), their function remains unclear (Theissen 
et al., 2001). On the other hand, NZZ, SAP, and ULT 
all play specific roles during development of flowers 
(Byzova et al., 1 999; Schief thaler et al., 1999; Carles etal., 
2005) and are absent from P. patens. The C2C2 zinc 
finger protein YABBY is expressed in a polar manner 
and specifies the abaxial identity of lateral organs of the 
Arabidopsis sporophyte (Siegfried et al., 1999), while 
the moss sporophyte is extensively reduced and pos- 
sesses no lateral organs. Likewise, spinach (Spinacea 
oleracea) S1F mRNA accumulates in roots and etiolated 
seedlings (Zhou et al, 1995), while both tissues are not 
present as such in P. patens. Hence, absence of these 
specialized TF families from a basal land plant seems 
plausible. 

Coverage of Known TAP Families 

To analyze the level of completeness of our dataset, 
we compared numbers of PlanTAPDB family mem- 
bers with the size of well-known Arabidopsis TAP 
families. In Supplemental Table S2, those PlanTAP 
families that were previously described by Riechmann 
and colleagues (Riechmann et al., 2000) and/or are 
included in the current version (Version 2; July 2006) of 
DATF (Guo et al., 2005) are listed. To allow comparison 
of PlanTAPDB Arabidopsis members with these re- 
sources, only those member sequences corresponding 
to The Institute for Genomic Research (TIGR) Arabi- 
dopsis loci (loci themselves or those replaced by re- 
dundant UniProt sequences) were counted. The 
numbers shown were ascertained immediately after 
filtering and clustering, as well as after redundancy 
removal and homology reduction (Supplemental Ta- 
ble S2). Fortunately, the step of redundancy removal in 
no case accidentally reduced the number of detected 
Arabidopsis loci. As expected, the homology reduc- 
tion leads to a decrease in size of large families. The 
coverage of a minority of Arabidopsis TAP families by 
PlanTAPDB differs significantly due to possible anno- 
tation errors within the different resources (e.g. the 
C3H family, which probably also includes RNA- 
binding C3H zinc fingers). Taken together, the data 
illustrate that the PSI-BLAST approach is able to re- 
cover most of the members for the majority of gene 
families. However, especially in gene families with 
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low sequence conservation apart from functional do- 
mains (e.g. MADS, HB), a significant amount of family 
members might be missing. This depicts an inevitable 
shortcoming of this automated approach for the dis- 
covery of gene families. Nevertheless, on average the 
filtered and nonredundant Arabidopsis loci as present 
in PlanTAPDB cover 81% of the previously published 
gene family members. 

Web Interface 

The PlanTAPDB Web interface (http://www.cosmoss. 
org/bm/plantapdb) provides dynamic access to the 
results generated in this study. TAP gene families can 
be retrieved by their accession numbers and identifiers 
or queried via keyword searches among the family 
annotations. In addition, all 37,247 TAP cluster se- 
quences (Fig. 1) can be queried using BLAST. The 
PlanTAPDB portal gives an overview of all available 
families of TFs, TRs, and PTs in the form of grouped 
lists or a clickable image map of their overall taxo- 
nomic profile (described below). Both provide access 
to the PlanTAPDB family entry of interest via hyper- 
links. The family viewer displays the results of the 
comprehensive manual annotation process (main fam- 
ily, subfamily, consensus Inter-Pro domains), as well as 
literature references and the list of annotated family 
members (including a graphical representation of their 
domain structure) for each of the 138 TAP families. The 
extensive information available for every member, e.g. 
Inter-Pro domains and taxon information, is cross- 
linked to the primary databases. The individual taxo- 
nomic profile, as well as species names and several 
other parameters, can be used to filter the family 
member list. All member sequences can be retrieved 
selectively in FASTA format. The cluster(s) of which a 
PlanTAPDB family is composed can be accessed via 
links to the corresponding cluster view(s) and contain 
the following features: (1) the cluster's description and 
an optional comment that provides additional infor- 
mation derived from the manual annotation process; 
(2) the distance matrix and detailed statistics in the 
form of histograms and box plots, describing the 
cluster's sequence diversity as found in the redun- 
dancy removal and the homology reduction phase of 
TreePipe; (3) a graphical overview describing the distri- 
bution of the sum-of-pairs score, Shannon's entropy 
score, the gap ratio, and the column removal threshold 
along the length of the complete alignment used in the 
selection of conserved sites; (4) the initial alignment of 
all cluster members used to build the distance matrix 
as well as the filtered alignment, which was used to 
infer the phylogeny, viewable and downloadable via 
the Jalview applet (Clamp et al., 2004); and, finally, 
(5) the phylogenetic trees and various parameters 
describing the homology-reduced ML tree topolo- 
gies, which can be viewed and downloaded in 
New Hampshire/eXtended (http://phylogenomics.us/ 
forester/NHX.html) format with bootstrap values and 
color-coded taxon information using the ATV applet 



(Zmasek and Eddy, 2001). The applet also allows by- 
node (group) retrieval of sequences displayed using 
the cosmoss sequence retrieval system (Lang et al., 
2005). Next to the button providing the homology- 
reduced topologies gained by the combined ML/NJ 
approach, two additional trees containing all cluster 
members (i.e. prior to the redundancy removal and 
homology reduction steps) can be viewed for each 
cluster of a TF or TR family The first tree displays an 
unrooted NJ topology with bootstrap values and the 
second one a midpoint-rooted NJ topology with ML 
branch lengths. 

Different Expansion of TAP Gene Families among 
Algae and Plant Lineages 

Previous global comparative studies of plant TAP 
gene families focused mainly on the subgroup of 
DNA-binding TFs in seed plants (for review, see Qu 
and Zhu, 2006). On basis of the PlanTAPDB data, we 
compared characteristics of plant TAP gene families 
across six species, for which genome-scale databases 
were queried during hornolog detection. These in- 
cluded three algae, a moss, and two flowering plants 
to provide a broad evolutionary perspective. The total 
number of distinct TFs, TRs, and PTs of these species 
was extracted using the taxonomic annotation of the 
family members. The numbers of TFs detected by the 
approach presented here are smaller than previously 
published results for Arabidopsis, rice (Xiong et al, 
2005; Gao et al., 2006; Qu and Zhu, 2006), and C 
reinhardtii (http://chlamytfdb.bio.uni-potsdam.de/ 
v2.0/), which is due to the stringent filtering process 
applied to prevent false-positive hits. 

There seems to be a trend that total amounts of TAPs 
(Fig. 2) are associated with the number of cell types in 
the respective organism (there is no significant differ- 
ence between Arabidopsis and rice [P = 0.84], but P. 
patens differs significantly from both the flowering 
plants and the algae in this regard [P < 0.001]). A 
correlation of numbers of TFs with organism com- 
plexity (which might be defined as number of cell 
types) has previously been described for animals 
(Levine and Tjian, 2003). The low amount of TF genes 
in the three algae as compared with the three land 
plants (Fig. 2A, P < 0.001) coincides with reports for 
basal metazoans (the demosponge Reniera, the uro- 
chordate Ciona, the worm Caenorhabditis elegans, and 
the fly Drosophila melanogaster), which contain a much 
lower amount of TFs than mammals (Riechmann et al., 
2000; Reece-Hoyes et al., 2005; Satou and Satoh, 2005; 
Larroux et al., 2006). The fraction of TAPs per protein- 
coding genes in the respective genomes (Fig. 2B) 
depicts the same trend of association with the number 
of cell types. 

The gene family data (Fig. 3A) reveal an extensive 
(4. 7- fold) increase in the number of different TF gene 
families with the transition from the three algae (av- 
erage 12.0 ± 5.0 families) to the three land plants 
studied here (average 56.7 ± 0.6). The number of TR 
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Figure 2. Abundance of plant TAP genes. The absolute (A) and relative 
(B) amounts of TAP genes in six species (see Supplemental Table S1 for 
abbreviations) are shown as bar charts and numerical values (C). The 
absolute gene numbers were inferred from the NCBI taxonomy infor- 
mation of all TAP family members. The relative abundance of TAPs is 
shown in relation to the total number of predicted proteins within the 
respective organism. TFs are shown in green, TRs in orange, and PTs in 
yellow. 



families exhibits the same trend (average 28.7 ± 4.0 
versus 54.0 ± 1.0), but less pronounced (1.9-fold), 
indicating an increased importance of TF genes for the 
evolution of the three land plants in question. Consis- 
tent with this, components of the basal transcriptional 
machinery and general TFs are known to be conserved 
across the three domains of life, while DNA-binding 
TFs have been shown to evolve in a lineage-specific 
way in plants as well as in animals (Coulson and 
Ouzounis, 2003; Gutierrez et al., 2004). A relationship 
between the increasing number of plant TAP families 
and the gain in morphological complexity has been 
hypothesized before (Doebley and Lukens, 1998; Hsia 
and McGinnis, 2003; Gutierrez et al., 2004). Addition- 
ally, because basal multicellular metazoans already 
contain most of the TF families present in mammals 
(Riechrnann et al., 2000; Messina et al., 2004; Reece- 
Hoyes et al., 2005; Larroux et al., 2006), which is not the 
case for the comparison of algae and land plants as 
shown above, this explosion of gene family number 



might well be related to the switch from unicellularity 
to multicellularity. This theory is further supported by 
the fact that the fraction of human TAP families 
present in unicellular fungi is drastically reduced as 
compared to metazoans (Riechrnann et al., 2000; 
Coulson and Ouzounis, 2003; Messina et al., 2004). In 
concordance with this, nearly all of the different plant 
TAP gene families are already present in the basal land 
plant P. patens (Fig. 3A). However, in Arabidopsis and 
rice, the size (but not the number) of TAP gene families 
is significantly increased (Fig. 3B, P < 0.001), which 
might reflect the more complex body plan and spe- 
cialization of the two angiosperms as compared with 
P. patens. However, it should be noted that the differ- 
ences in average gene family size might also be due to 
inheritance from the respective last common ancestor 
and thus might not be related to morphological com- 
plexity. Yet, the role of lineage-specific expansion of 
gene families for the evolution of etikaryotes was stud- 
ied before and predominantly occurs in plants, espe- 
cially in the case of TAP gene families (Lespinet et al., 
2002; Shiu et al., 2005). 

Species-Specific Expansion of Individual TAP Families 

The absolute size of the 138 annotated TAP fami- 
lies for the above-mentioned six species is shown in 
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Figure 3. Number and size of plant TAP families. Number (A) and 
average gene family size (B) of TAP families in six species (see 
Supplemental Table SI for abbreviations) are shown. The average 
gene family size was calculated as the ratio of absolute number of 
family members per number of families. TFs are shown in green, TRs in 
orange, and PTs in yellow. 
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Supplemental Table SI. The size distribution of the 
Arabidopsis TF gene families correlates well with 
published results (Qu and Zhu, 2006), although the 
families are generally smaller due to the stringent 
elimination of false-positive hits applied in this study. 
The overall lineage-specific expansion of family num- 
ber and size is evident from Figure 3, as well as from 
the absolute values in Supplemental Table SL On 
average, TAP gene families are 2 to 3 times larger in R 
patens than in the three algae, while in Arabidopsis and 
rice the TR and PT families show an approximately 
4-fold increase and TF families a 9-fold increase as 
compared to the algae (Fig. 3). Consistent with this, 
recent comparative studies revealed that TAP family 
sizes in Arabidopsis and rice often expand with sim- 
ilar rates (Shiu et al., 2005; Xiong et al., 2005). We also 
determined gene families that were subject to individ- 
ual expansion against the background of lineage- 
specific evolution, i.e. families in which above-average 
expansion of distinct gene families per species oc- 
curred. The data underlying Supplemental Table SI for 
each of the three TAP groups (TF, TR, and PT) were 
normalized using the respective total amount of genes 
per group, and significantly deviating families were 
highlighted by framing. In total, 29 families exhibit 
species-specific expansion, two of which are present in 
Arabidopsis, one in rice, 10 in P. patens, four in C. 
reinhardtii, nine in C. merolae, and three in T. pseudonana. 
Moss and the algae contain more specifically ex- 
panded TAP families (e.g. HIT and CONSTITUTIVE 
PHOTOMORPHOGENIC1 [COP1] in P. patens, PcG 
and SBP in C. reinhardtii, FHA and TFb2 in C merolae, 
DUF833 in T. pseudonana) than the two seed plants, 
which might be due to the fact that the overall expan- 
sion rate is less pronounced in the former organisms. 

As an example, members of a distinct branch of the 
His triad family (TF033, HIT) known from animals 
(Kijas et al., 2006) and fungi are only present in rice 
and moss. Interestingly, the human HIT protein Apra- 
taxin, which belongs to this family, has recently been 
shown to be involved in the protection against geno- 
toxic stress by interaction with proteins that are 
involved in DNA repair (Gueven et aL, 2004). Appar- 
ently, the forefather of this particular gene was already 
present in ancestral eukaryotes but has been lost in 
some plant and algal lineages. The P. patens Aprataxin- 
like protein might be involved as an upstream com- 
ponent of DNA mismatch repair (Trouiller et al., 2006) 
and thus might be related to the high efficiency of 
homologous recombination observed in the moss 
(Kamisugi et al., 2005). 

Taxonomic Distribution of Plant TAP Families across 
All Domains of Life 

For visualization of the distribution of TAP family 
members across all taxonomic lineages, a taxonomic 
profile was created and is presented as a heat map in 
Figure 4. Initial tests using taxonomic resolution fixed 
at the kingdom or order level, respectively, were not 



able to resolve the expected phylogeny of the contrib- 
uting taxa using columnwise clustering (data not 
shown). Therefore, those taxonomic groups that con- 
tributed significantly to the overall distribution were 
selected as columns; the remainder of the Eubacteria, 
protists, plants, and animals were gathered into the 
respective "other" columns. Thus, a nonredundant 
representation of the taxonomic distribution was cre- 
ated that is able to resolve the expected phylogeny 
using columnwise clustering. To overcome the sam- 
pling bias presented by fully sequenced genomes, the 
columns were normalized. Subsequent clustering 
yielded the significantly correlated groups depicted 
in Figure 4. The top half of the taxonomic profile 
contains families that are predominantly present in 
plants. Within these, the first significantly correlated 
cluster is almost completely composed of large plant 
TF families, most of which have been described as 
plant specific before (highlighted by green text color), 
while the second cluster contains a mixture of plant 
TAP families not yet discovered in Asterids. Only a 
few families, mostly TRs, are abundant in both pro- 
karyotes and eukaryotes (located mainly in the middle 
part of the profile). The families in the second half of 
the profile are shared between plants and other eu- 
karyotes and are sometimes present in Eubacteria and 
Archaea as well. The TR families accumulate within 
these clusters, especially in the lowest part. This dis- 
tribution correlates very well with published data 
(Riechmann et al., 2000; Coulson et al., 2001; Coulson 
and Ouzounis, 2003) and indicates that TFs often fulfill 
lineage- or kingdom-specific functions, while basal com- 
ponents of transcriptional regulation are conserved 
across different eukaryotic kingdoms or sometimes 
even across the primary domains of life. The profile 
gives a good impression about the distribution of 
certain families or clusters of families among taxo- 
nomic groups. It can be applied to narrow down the 
probable function of PTs, such as PT007 (DUF296 and 
HMG DNA-binding domain containing), which is 
located in the topmost significantly correlated cluster 
that is mainly composed of plant-specific TFs. The 
hypothesis that PT007 might represent a novel TF 
family is fortified by the domain structure of the mem- 
bers, most of which contain the two PFAM (Finn et al., 
2006) domains AT_hook (PF02178) and DUF296 
(PF03479), which are known to be present in this par- 
ticular order in a class of proteins that is thought to 
have DNA-binding activity. Overexpression of a pro- 
tein containing DUF296 led to late flowering and mod- 
ified leaf development in Arabidopsis (Weigel et al., 
2000). In addition, the taxonomic profile can be em- 
ployed to reveal families with biased profiles, which 
point at interesting evolutionary differences. As an 
example, among the plant-specific upper part there are 
clusters that, besides mosses and seed plants, contain 
sequences from "other," i.e. nonphotosynthetic, pro- 
tists, such as PT020 (TPR and Ankyrin domain con- 
taining; Fig. 4). A closer look reveals that the cluster 
contains sequences from the kinetoplastid parasites 
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Figure 4. Taxonomic profile -of the plant TAP families. The NCBI 
taxonomy information for all family members was parsed per annotated 
TAP family. The columns represent those taxonomic groups that 
contributed significantly to the distribution; the remainder of the 
Eubacteria, protists, plants, and animals are represented as "other/' 
respectively. After normalization of the columns (log odds ratio), the 
rows were clustered and visualized as a heat map {yellow = overrep- 
resented, blue = underrepresented, black = average representation, 
gray = missing). In the case of overrepresentation and underrepresen- 
tation, the color intensity increases with rising distance from zero. All 
clusters with a centered Pearson correlation coefficient R s 0.7 are 
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Trypanosoma cruzi and Leishmania major. While the 
Alveolata, such as the malaria parasite Plasmodium, 
harbor the remnant of a plastid, the so-called apico- 
plast (Waller and McFadden, 2005), this is not the case 
for the Kinetoplastida. Yet, they belong to the photo- 
synthetic Euglenozoa, and, based on some plant-like 
nuclear genes, it has been argued that they have under- 
gone secondary loss of a plastid (Hannaert et al., 2003), 
which coincides nicely with the data from cluster 
PT020. 



The WUSCHEL/WOX Phylogeny 

The HB/WUSCHEL (WUS) family (TF032_373) ex- 
hibits a rigorous land plant-specific taxonomic profile, 
comprising the species Arabidopsis, tomato (Solarium 
ly coper sicum), poplar (Populus spp.)/ rice, and P. patens. 
The consensus domains for this family are Homeobox 
(IPR001356), Homeodomainjike (IPR009057), and 
Homeodomain-rel (IPR012287). During redundancy 
filtering, 10 nearly identical sequences belonging to 
Arabidopsis, rice, and poplar were removed. The 
average identity between the remaining sequences is 
relatively low (36.26%); therefore, the alignment was 
reduced from an initial 950 columns to 167 columns 
that could be unequivocally aligned, comprising mainly 
the actual homeobox domain. Due to the low conser- 
vation grade of the WUS-related (WOX) gene family 
(e.g. 30.6% amino acid identity between Arabidopsis 
WOX9 and WOX14), several annotated homologs 
were not detected by the PSI-BLAST search and thus 
are missing from the above-mentioned phylogeny. To 
add those, all annotated Arabidopsis WUS /WOX se- 
quences were retrieved from Swissprot. After retrieval 
of the remainder of the sequences using the Plan- 
TAPDB Web interface, MSA and tree reconstruction 
were performed. The phylogeny is available via the 
Web interface as well, as an example for manually 
curated data to be added upon request. The resulting 
tree (Fig. 5) is clearly separated into two clusters, one 
containing Arabidopsis WUS itself as well as the 
majority of WOX sequences, and the other containing 
Arabidopsis WOX 10, 13, and 14. While WUS has been 
shown to be involved in shoot meristem maintenance 
(Mayer et al., 1998; Leibfried et al., 2005; Kieffer et al, 
2006), the role of the other members of the gene family 
is not well defined yet, although some of the genes are 
involved in early embryonic cell fate decisions (Haecker 
et al., 2004). Given the deep cleft in the phylogenetic 
tree, the ancestral WUS/WOX gene probably had 
already acquired a paralog in the last common ancestor 
of all land plants. However, because P. patens homo- 
logs are exclusively present in the cluster containing 
the WOX 10, 13, and 14 homologs from Arabidopsis, 



displayed in color to the left of the heat map. The PlanTAPDB family 
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Figure 5. Phytogeny of the WUS/ 
WOX gene family. A manually ex- 
panded phylogeny of the Plan- 
TAPDB WUS family (TF032_337) 
is presented. The ML tree is shown 
with quartet support values at the 
nodes. The protein sequences are 
represented by species name and 
accession number. [See online arti- 
cle for color version of this figure. 1 



the paralog later giving rise to WUS was probably lost 
from the moss lineage after the divergence from the 
flowering plants. While involvement in stem cell 
maintenance and early embryo development has 
been described for WUS and several WOX gene pro- 
ducts, this is not the case for WOX 10, 13, and 14 
(Haecker et al., 2004). Therefore, the function of this 
retained ancestral WOX homeobox TF subfamily re- 
mains enigmatic at present. 

The COP1 Phylogeny 

The three uppermost clusters of the taxonomic 
profile (Fig. 4) contain families that are generally 
present in plants and also appear erratically in other 
taxonomic groups. Among those, the PT family PT024 
(COP1) can be found. It attracts attention because of 
the overrepresentation of moss sequences that is ap- 
parent from both the taxonomic profile (Fig. 4) and the 
species-specific expansion (Supplemental Table SI), 
which is in contrast to the generally lower amount of P. 
patens TAPs as compared to rice and Arabidopsis (Figs. 
2 and 3). In angiosperms, the E3 Ubiquitine ligase 
COP1 acts as a photomorphogenesis/skotomorpho- 
genesis switch by degradation of downstream factors 
in the dark, while it is inactivated by nuclear depletion 
in the light (Holm and Deng, 1999). In mammals, the 
homolog has been suggested to be involved in tumor- 
igenesis and stress response (Yi and Deng, 2005). In 
Arabidopsis, a single COP1 gene is present that com- 
prises several WD40 domains and an N-terminal 
RING domain. Consequently, the PT024 family was 
annotated using the domains WD40 (IPR001680)/ 
WD40Jike (IPR011046) and Znf_RING (IPR001841). 



Through redundancy removal the cluster members 
were reduced from 37 to 26, the redundant sequences 
originating from Arabidopsis, tomato, pea, and rice. 
The family is well conserved (average 61.39%) with 
even the rat homolog sharing 43.53% amino acid 
sequence identity with Arabidopsis COP1. The tree 
(Fig. 6) is clearly divided into two parts. The lower 
subtree contains most of the plant sequences, includ- 
ing Arabidopsis COP1 and several orthologs from 
monocots and dicots. Surprisingly, this cluster also 
contains a total of 11 P. patens sequences. While all the 
seed plant proteins in this cluster contain RING do- 
mains, this is not true for any of the moss sequences. 
The proteins in the upper subtree, containing some 
plant sequences as well as the rat and Dictyostelium 
homologs, do not contain RING domains, with the excep- 
tion of the Brassica and Dictyostelium sequences. The 
Arabidopsis proteins present in this part of the phy- 
logeny are SPA (suppressor of phytochrome A) pro- 
teins, which are dimerization partners of COP1 
(Laubinger et al., 2006). Because the P. patens data are 
based on clustered ESTs, it is possible that too many 
homologs are present in the tree and that the se- 
quences are fragmentary. Therefore, we analyzed the 
genomic situation by detecting and clustering all ho- 
mologs within the whole-genome shotgun sequence 
data available via http://www.cosmoss.org. This 
analysis revealed that a total of nine COP1 homologs 
are present in the genome, all of which contain a RING 
domain (which was missing from the virtual tran- 
scripts because of incomplete EST data). The genomic 
sequences are covered by the 11 virtual transcripts 
present in the tree. We also detected an additional SPA 
homolog that lacks the RING domain (data not 
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Figure 6. Phylogeny of the COP1/ 
SPA gene family. The automatically 
generated PlanTAPDB phylogeny 
of the COP1 family (PT024) is pre- 
sented. The consensus tree of 1 00 
bootstrapped NJ trees with ML 
branch lengths is shown; bootstrap 
values are shown at the nodes. The 
protein sequences are represented 
by species name and accession 
number. [See online article for 
color version of this figure.! 
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shown). Hence, P. patens seems to have acquired and 
retained several COP1 paralogs during evolution. 
While mosses are not able to etiolate, they certainly 
do possess photomorphogenesis and harbor a full 
complement of photoreceptors (Bierfreund et al., 2004; 
Ichikawa et al., 2004; Kasahara et al., 2004; Mittmann 
et al., 2004; Uenaka et al., 2005). Still, the expansion of 
this particular gene family is puzzling. It has been 
demonstrated recently in Arabidopsis that COP1, yet 
not the SPA proteins, is involved in UV-B tolerance by 
coordination of ELONGATED HYPOCOTYL5 (HY5) 
controlled as well as other pathways (Oravecz et al., 
2006). The closest homologs of Arabidopsis HY5 are 
present in cluster TF011_518, which belongs to family 
TF011 (bZIP). The proteins are well conserved (iden- 
tity 65.68%) and contain a single moss ortholog. 
Therefore, expansion of COP1 downstream factors is 
not apparent in moss. However, maybe the plethora of 
P. patens COP1 proteins aids in acquiring UV tolerance, 
a process that has been associated with pigment 
changes, e.g. in an Antarctic moss (Newsham, 2003). 



Caveats 

PlanTAPDB users should be aware that the auto- 
mated homolog detection and clustering approach 
resulted in the loss of some gene families, i.e. a low 
percentage (approximately 4%) of plant TAP families 
is missing. In addition, on average 19% of the gene 
family members known from well-annotated genomes 
are lacking. To present phylogenetic trees that can be 
viewed on a normal computer screen, large gene 
families have been reduced to contain a maximum of 
150 homology-condensed members. Due to the frag- 



mentary nature of the data (incomplete genome /tran- 
scriptome data, fragmentary sequences, sampling bias), 
the phylogenetic analyses might be biased or flawed. 
Taken together, users should take appropriate caution 
concerning the points raised above while interpreting 
the data. 

Potential Uses 

The PlanTAPDB resource might be used as a starting 
point for knowledge discovery. Using the family and 
cluster annotation available through the Web interface, 
designated gene families can be located, e.g. by name 
or member sequence accession number. MSAs of the 
gene families as well as arbitrary sequence subsets can 
be retrieved. The taxonomic profile (Fig. 4, also avail- 
able via the Web interface) and the overrepresentation 
analysis (Supplemental Table SI) might be employed 
to detect biased taxonomic distribution. Descriptive 
data, such as sequence conservation, gene family size, 
species distribution, and alignment properties, are avail- 
able. Cross-links to sequence, domain, and literature 
databases enable simple access to related information. 
Finally, the phylogenetic trees offer an evolutionary 
vantage point for nonexperts. 



CONCLUSION 

So far, most comparative analyses dealing with 
plant TAPs have focused on TFs of Arabidopsis and 
rice. To broaden our evolutionary understanding of 
transcriptional regulation in plants, we have included 
three algae and a moss into the present analysis, as 
well as the complete UniProt database. In addition, we 
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have analyzed both TFs and TRs, and have detected 
several novel PT families. Using automated methods, 
a stringent detection and representation of gene clus- 
ters has been established that can easily be expanded 
to cover more genomes in the future, while manual 
curation of gene clusters into families assures their 
quality. High-quality phylogenetic trees were created 
from these clusters and are available through an easy- 
to-use Web interface together with a multitude of ac- 
companying data, such as alignments, domain-based 
family annotation, and taxonomic profiles. Instant 
knowledge discovery using the PlanTAPDB is straight- 
forward, as has been demonstrated using several 
examples. In addition, such comparative data can be 
applied to aid phylogenomics. 

The general expansion of both the total number of 
TAP genes and the amount of TAP families seems to 
coincide with organism complexity. A dramatic in- 
crease in the complexity of transcriptional regulation, 
particularly at the level of TFs, might have occurred 
after the development of multicellularity, respective 
the transition from water to land. Subsequently, dur- 
ing land plant evolution, the intricacy of the previ- 
ously established TF families enhanced again, possibly 
reflecting large-scale morphological and physiological 
changes paralleling angiosperm radiation. Apart from 
these general trends, distinct TAP gene families were 
subject to expansion in individual species. Interesting 
details about the evolution of the stem cell regulator 
WUS, the photomorphogenesis switch COP1, and the 
genotoxic stress-related HIT gene family were re- 
vealed. 



MATERIALS AND METHODS 
Sequence Datasets 

For the identification of Physcomitrella patens transcription-associated EST 
sequences, National Center for Biotechnology Information (NCBI) Entrez (Geer 
and Sayers, 2003) was utilized to query GenPept (Benton, 1990) Release 141. The 
Arabidopsis Information Resource (TAIR; Rhee et al., 2003) resources were 
searched via keyword. GenPept Release 151 and the TIGR Arabidopsis 
(Arabidopsis thaliana) and rice {Oryza sativa) predicted proteins (see below) 
were used for the closest homolog determination. For the collection of homologs 
throughout the available protein space using PSI-BLAST, the UniProt Knowl- 
edgebase Release 7.1 (http://www.ebLuniprot.0rg/database/download3html) 
was used. In addition, the following organism-specific protein databases were 
included. Arabidopsis: 28,952 predicted proteins, TIGR ATH1 .pep 01/04 (ftp:// 
ftp.tigr.org/pub/data/Eukaryotic_Projects/a_thaliana/annotation_dbs/ATHl. 
pep). Rice: 88,149 predicted proteins, TIGR OSAl.pep 04/04 (ftp://ftp.tigr.org/ 
pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/ 
version_2.0/). P. patens'. EST were clustered and assembled according to Lang 
et al (2005), http://www.cosmoss.org Release 03/04. For the resulting virtual 
transcripts, ORF were predicted using FrameD (Schiex et al., 2003) and ESTscan 
2.0 (Iseli et al., 1999) with P patens-specific models, yielding a total of 52,458 ORFs. 
Thalassiosira pseudonana: 11,397 predicted proteins from Release 1.0, Department 
of Energy Joint Genome Institute (http://genome.jgi-psf.org/thapsl/thapsl. 
download. ftp). Cyanidioschyzon merolae: 5,013 translated mRNAs, Release 11/04 
(http://merolae.biol.s.u-tokyo.ac.jp/download/cds_nt.fasta). Chlamydomonas 
remhardtii: 19,832 predicted proteins from Release 2.0, Department of Energy 
Joint Genome Institute (http://genome.jgi-psf.org/chlre2/chlre2.download. 
ftp). For the calculation of Figure 2B, the recently corrected number of protein- 
coding genes for rice, 30,000, has been used (Itoh et al., 2007), and the estimated 
number of 25,000 protein-coding genes for P. patens (Rensing et al., 2002; Lang 
etal.,2005). 



Software 

The results and resources presented here were generated using an auto- 
mated phylogeny pipeline that utilizes BLAST and PSI-BLAST (Altschul etal., 
1997), Inter-ProScan 4.2 (Quevillon et al., 2005), EMBOSS 3.0.0 (Rice et al., 2000), 
MAFFT 5.8 (Katoh et al., 2005), ProbCons 1.1 (Do et al., 2005), Muscle 3.52 
(Edgar, 2004), Phylip 3.65 (Felsenstein, 1989), Tree-Puzzle 5.2 (Schmidt et al., 
2002), a modified version of the puzzleboot script (http://www.tree-puzzle.de/ 
puzzlebootREADME.txt), and the PostgreSQL 8.0.8 (http://www.postgresql. 
org) relational database. This so-called TreePipe is able to construct phyloge- 
netic trees for large datasets without manual interference and is implemented 
with Perl 5.8.7 (http:/ /www. perl. com), SQL, and shell scripts, making use of 
the Bioperl CVS "live" branch (Stajich et al., 2002) and the Bio: :Phylo Version 
0.09 (http: / /sea rch.cpan. org/ ~rvosa/ Bio- Phyl O-0.09/ ) packages. The exten- 
sive data that are collected throughout the pipeline are stored in a relational 
database schema developed for this project, called TreePipeDB. The Plan- 
TAPDB Web interface is implemented using mod_perl 2.0 (http:/ /perl. 
apache.org/) and Javascript with the TreePipeDB as backend. For the interac- 
tive exploration of MSA and phylogenetic trees, we integrated the Jalview 
multiple alignment editor 2.08.1 (http://www.jalview.org/) and ATV phylo- 
genetic tree viewer 2.0 BETA (http://www.phylogenomics.us/atv/) Java 
applets. 

Identification of the TAP Query Set 

NCBI GenBank was queried using the keywords "transcription factor/' 
"transcription activator," "transcription repressor," and "transcription regu- 
lator/' as well as taxon IDs of Viridiplantae and nongreen algae (txids 33090, 
136419, 3027, 33682, 38254, 2830, 2763, 33634). Additionally, Arabidopsis loci 
were extracted from TAIR matching the keyword "transcription factor." With 
this reference set of 7,476 TAPs, the clustered P. patens EST sequences were 
■ searched by TBLASTN. A total of 286 PFAM HMM profiles and 67 PROSITE 
patterns of transcription-associated domains without taxonomic restriction 
were used for motif searches in the same database. A total of 1,592 nonredun- 
dant P. patens candidate TAP sequences were identified. Full-length closest 
homologs of the 1,592 moss candidate TAP transcripts were determined via 
BLASTX (Altschul et al., 1997) with an E-value cutoff of 1E-3 against GenPept 
and the TIGR Arabidopsis and rice predicted protein databases. The resulting 
hits were filtered using an alignment length and percent identity threshold of 50 
amino acids and 25%, respectively. 

PSI-BLAST Searches and Filtering of the Results 

PSI-BLAST searches were performed against the UniProt Knowledgebase, 
all available whole-genome predicted protein databases of plants and algae, 
and the predicted ORF of the P. patens virtual transcripts using an E-value 
threshold of 1E-4, a profile inclusion threshold of 1E-5, and four iterations. Up 
to 500 results per query were considered and parsed into the TreePipeDB. 
Each result set (composed of one query and its hits after one of the four PSI- 
BLAST iterations) was run through a series of six filter steps with increasing 
stringency concerning the length and percent identity of the PSI-BLAST 
matches (step 1: 25% identity /50-amino acid alignment length; step 2: 30%/60 
amino acids; step 3: 35%/80 amino acids; step 4: 45%/ 100 amino acids; step 5: 
45%/ 150 amino acids; step 6: 45%/300-amino acid length). For each query and 
iteration, the filtering process determines the first filtering step that reduces 
the result set to ^50 and >5 members. Afterward, the optimal iteration (plus 
determined filtering step) is chosen for each query, using a set of sequentially 
applied criteria: (1) the most stringent possible filtering step, (2) the maximal 
number of remaining sequences, and (3) the lowest iteration step (in order to 
select result sets with low amounts of false-positive hits). 

Clustering of the Filtered Result Sets 

Single-linkage clustering using a stringent hit-coverage-based distance 
measure was implemented in Perl and the TreePipeDB backend. Result sets of 
two queries were merged if they shared at least one hit covering the same 
region of this hit sequence. The length of the region to be shared depends on 
the previously selected filter step, namely, the most stringent filter step 
possible (e.g. result set A overlaps with B on hit X). A was filtered using step 6 
and B using step 5. Hence, A and B can only then be merged into a cluster if 
they overlap to at least 300 amino acids (step 6 criteria) on sequence X. Result 
sets without any significant overlaps were added as single-query clusters. For 
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all cluster members, the corresponding NCBI taxonomy annotation was 
retrieved and stored in TreePipeDB. 

Redundancy Removal and Homology Reduction 

For the removal of redundant sequences, a MSA was performed using MAFFT 
FFT-NS-2 and pairwise distances were calculated using the EMBOSS distmat 
program. This alignment was used to infer initial phylogenies of the complete 
clusters. The resulting matrix was scanned for sequence pairs from the same 
spedes with a distance ^1 substitutions per 100 amino acids. For each pair, one 
representative was selected based on the originating database (UniProt sequences 
were preferred), sequence length, and lexical sort order of the accession number. 
The procedure was implemented in Perl using several Bioperl modules, includ- 
ing a mod ified version of the B io Tools : : Run : : Alignment : : MAFFT mod ule. For 
the parsing of the distmat distance matrices, an object-oriented Bioperl module 
(Bio: : Matrix : : IO : : distmat) was written. Homology reduction was implemented 
in the same program but follows a different strategy Beginning with 1 substitution 
per 100 amino acids and heuristically increasing this distance threshold, the 
distance matrix is itera lively scanned for sequence pairs with the respective 
distance, regardless of their species. The iteration stops when the remaining 
representative cluster members reach a given limit (150 sequences). 

Multiple Alignments and Selection of Informative Sites 

Multiple alignments for a given cluster were performed using MAFFT 
G-INSI and ProbCons (clusters =s 150) or Muscle (clusters >150). Subsequently, 
sum-of-pairs scores using the BLOSUM62 substitution matrix, gap ratios, and 
Shannon's entropy scores (Valdar, 2002) were calculated and recorded 
columnwise in the TreePipeDB. Finally, columns below a sum-of-pairs score 
of -2 were excised from the alignment. The procedure was implemented in a 
Perl program, which, besides the filtering of a given MSA, also produced 
overview graphics of the different scores along the overall alignment. 

Reconstruction of Phylogenies of the Representative 
Cluster Members 

Phylogenies for the representative cluster members were inferred using a 
Perl program on all clusters. After generation of 100 bootstrapped alignments 
using seqboot from the PHYLIP package, ML distance matrices were computed 
for these alignments using puzzleboot as implemented in Tree-Puzzle. These 
distance matrices were then used to infer topologies by applying the NJ 
algorithm as implemented in PHYLIP's neighbor program. Afterward, the 
resulting 100 trees were used to create a ML consensus topology using Tree- 
Puzzle. For the two steps where Tree-Puzzle was used to compute maximum 
likelihoods, eight gamma-distributed rates were used to model mutation rate 
heterogeneity and full (exact) ML parameter estimation was performed for each 
gene family. Manual ML trees were created using the same parameter settings. 
The WAG (Whelan and Goldman, 2001) evolutionary model of sequence 
evolution, which is derived from a database of globular proteins, was used. The 
resulting phy logenetic tree offered both an overall confidence value, i.e. the ML 
of the tree, and confidence values for every branch in the form of bootstrap 
values. Finally, the trees were parsed and midpoint-rooted via an additional 
Perl program that also collects a large variety of parameters from the tree 
topologies using both Bioperl and the Bio::Phylo modules (e.g. the longest 
internal branch, the Fiala stemminess JFiala and Sokal, 1985], and the resolu- 
tion) and writes them into the TreePipeDB. The initial phylogenies for the 
complete clusters were inferred in analogy to the procedure described above, 
using a Perl wrapper combining the PHYLIP tools seqboot and neighbor. 
However, in this case JTT distances Oones et al., 1992) were calculated with 
PHYLIP's protdist and consensus trees with consense to cope with the runtime 
demands of clusters up to 1,182 members. Finally, the consensus topologies 
were used to estimate ML branch lengths with the user-tree option of Tree- 
Puzzle, using uniform rates and exact parameter estimation. 

Cluster and Gene Family Annotation 

The non redundant cluster member sequences were annotated using Inter- 
ProScan 4.2 with all available databases of the Inter- Pro Release 12.1. The 
annotated domains and associated GO terms were stored in the TreePipeDB. 
Inter-ProScan searches (Quevillon et al., 2005) were performed for the 37,247 
distinct cluster members after redundancy removal. A total of 99.8% of the 
sequences could be annotated with Inter-Pro domains. Sixty-two percent of the 
domains found were from the PANTHER (Mi et al., 2005), PFAM (Finn et al., 



2006), and PROSITE (Hulo et al., 2006) databases. Manual curation was 
performed by inspection of the description lines of the enclosed UniProt 
sequences and by inferring the classification of Arabidopsis cluster members 
from DATF (Guoetal., 2005) and ArabTFDB (http: //arabtfdb.bio.uni-potsdam. 
de/vl.l/). To further assign thus far undetected TAP families, their corre- 
sponding Arabidopsis and rice members collected from DATF (Guo et al., 
2005), ArabTFDB (http://arabtfdb.bio.uni-potsdam.de/vM/), DRTF (Gao 
etal., 2006), and RiceTFDB (http://ricetfdb.bio.uni-potsdam.de/v2.1/) were 
used to screen the nonredundant cluster members for homologs by BLASTP. 

Species-Specific Expansion, Taxonomic Profiling, 
and Statistical Tests 

The PlanTAPDB family sizes in six genera, Arabidopsis, rice, P. patens, C. 
reinhardtii, C. merolae, and T. pseudonana, were inferred using the NCBI 
taxonomy information of the nonredundant list of family members. These 
values were normalized using the total amount of members per group (TF, TR, 
or PT) in order to account for the general differences in TAP family sizes. If the 
fraction of family members in a given species deviated from the arithmetic 
average of the group with a z score of ^1.8, it was marked as expanded (no 
gene family was significantly reduced according to this criterion). The cutoff 
was chosen based on a distribution plot of all z scores (data not shown). 

For visualization of the taxonomic composition of the TAP families . 
(taxonomic profile), all taxa were allocated into 20 nonredundant taxonomic 
groups that were chosen because they contributed significantly to the distri- 
bution of NCBI taxonomy strings. After normalization for taxonomic group 
size (columnwise log ratio per average), the rows were used for average- 
linkage clustering with a centered Pearson-correlation distance and heat map 
visualization using Cluster 3.0 and JavaTreeview 1.0.12 (Eisen et al., 1998). 

Hypothesized differences in the size distribution of TAP gene families 
between organisms (Fig. 3B) were tested using two-sided t tests assuming 
unequal variances. Fisher's exact test was used to test for hypothesized 
differences between total number of genes of the six organisms (Fig. 2 A). The 
resulting P values were adjusted for multiple testing by calculating the false 
discovery rate (Benjamini and Hochberg, 1995). 

Supplemental Data 

The following materials are available in the online version of this article. 

Supplemental Table SI. Plant TAP family sizes in algae, moss, and 
flowering plants. 

Supplemental Table S2. Coverage of known TAP families through 
PlanTAPDB. 
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